[1] "Z: 9.98, W: 1"
Bayesianism main Ideas:
data \(X\) is fixed, and the parameters \(\theta\) of our process \(P_{\theta}\) are random
inference relies on the idea of updating prior beliefs based on evidence from the data
probabilities are used to quantify uncertainty we have about parameters
\[ \underbrace{p(\theta|d)}_\text{posterior} = \underbrace{\frac{p(d|\theta)}{p(d)}}_\text{update} \times \underbrace{p(\theta)}_\text{prior} \]
Bayesians think parameters are random. Frequentists think intervals (data) are random.
Bayesians make probability statements about parameters. Frequentists make probability statements about intervals.
\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]
Bayes’ Rule is a way to calculate conditional probabilities
\[ P(\text{covid} \mid +) = \frac{P(+ \mid \text{covid}) \cdot P(\text{covid})}{P(+)} \]
What is the probability of having covid given that you got a positive covid test?
\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})} \]
“How did my theory change after seeing the data?”
\[ \color{#D55E00}{P(\theta \mid \text{data})} = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})} \]
The Posterior is the probability of \(\theta\) after seeing the data.
\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot \color{#CC79A7}{P(\theta)}}{P(\text{data})} \]
The Prior is the probability distribution of \(\theta\) before seeing the data.
\[ P(\theta \mid \text{data}) = \frac{\color{#F5C710}{P(\text{data} \mid \theta)} \cdot P(\theta)}{P(\text{data})} \]
The Likelihood is the probability of our data, given \(\theta\), for various different \(\theta\)s
:::
\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{\color{#009E73}{P(\text{data})}} \]
The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).
It does not matter.
\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{\color{#009E73}{P(\text{data})}} \]
The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).
The normalizing constant makes \(P(\theta \mid \text{data})\) a valid probability distribution (i.e. \(\int P(\theta \mid \text{data}) d \theta = 1\)) but, it’s just a scalar constant…so \(P(\text{data} \mid \theta) \cdot P(\theta) \propto P(\theta \mid \text{data})\) 👀
\[ \left[P(\text{data} \mid \theta) \cdot P(\theta) \right] \propto P(\theta \mid \text{data}) \]
\[ \text{likelihood} \cdot \text{prior} \propto \text{posterior} \]
we have a function \(f(x) = P(\text{data} \mid \theta) \cdot P(\theta)\) that is proportional to a probability distribution \(p(x) = P(\theta \mid \text{data})\) that we want to sample from, but it itself is not a proper probability distribution…
❓ What does that remind you of?
Note: If we have draws from our posterior distribution \(p(x) = P(\theta \mid \text{data})\), we can use these draws to calculate any statistic we want: mean, median, quantiles on the draws or transformations of the draws.
[1] "Z: 9.98, W: 1"
\[ P(\text{flu} \mid \text{+}) = \frac{P(\text{+} \mid \text{flu}) \cdot P(\text{flu})}{{P(\text{+})}} \]
\(P(\text{flu}) = 0.05\) (prevalence of flu)
\(P(\text{+} \mid \text{flu}) = 0.99\) (sensitivity of test)
\(P(\text{+} \mid \text{no flu}) = 0.1\) (1- specificity of test)
\(P(\text{+}) = \underbrace{P(\text{+} \mid \text{flu})\cdot P(\text{flu})}_\text{way 1}+ \underbrace{P(\text{+} \mid \text{no flu})\cdot P(\text{no flu})}_\text{way 2}\)
We’re interested in estimating \(q\) the proportion of days it rains in California. It rained 12 of the last 365 days.
Binomial Likelihood: \(\mathcal{L}(q \mid x) = \binom{n}{x} q^{x} (1 - q)^{n - x}\)
Beta Prior: \(p \sim \text{Beta}(\alpha, \beta)= \frac{q^{\alpha-1} (1-q)^{\beta-1}}{B(\alpha,\beta)}\)
Beta Prior Shiny App
Play around with the app for a minute changing alpha and beta until you find a prior that looks reasonable to you.
We’re interested in estimating \(q\) the proportion of days it rains in California. It rained 12 of the last 365 days (\(x\)).
Binomial Likelihood: \(\mathcal{L}(q \mid x) = \binom{365}{12} q^{12} (1 - q)^{365-12}\)
Beta Prior: \(q \sim \text{Beta}(1, 9) = \frac{q^{1-1} (1-q)^{9-1}}{B(1,9)}\)
Remember: \(P(q \mid x) \underbrace{\propto \binom{365}{12} q^{12} (1 - q)^{365-12}}_\text{likelihood} \times \underbrace{\frac{q^{1-1} (1-q)^{9-1}}{B(1,9)}}_\text{prior}\)
Posteriors are inherently quantifications of uncertainty. They use a probability distribution to tell us the relative likelihood of different possible values of our parameter \(\theta\).
However, as great as they are, posterior distributions (or draws from a posterior) still have too much raw information to be useful.
Imagine handing your boss a posterior distribution in response to the question “how much more effective is blue font compared to black font”?
So, we still need summaries. In the Bayesian framework, we get point estimates by calculating statistics/summaries using our posterior.
E.g. the mean \(\mathbb{E}(p(\theta | x))\) of the posterior, or the median of the posterior.
In practice, we typically have samples from our posterior, \(p(\theta | x)\), not the distribution itself, but it’s actually easier to calculate summaries/statistics using samples! e.g.
\[ \frac{1}{n} \sum_{i=1}^n \underbrace{\theta_i}_\text{posterior sample of theta} \]
Credible Intervals are ranges of values \((lower,upper)\) that satisfies \(p(lb \leq \theta \leq ub) = c\), where \(c\) is a probability like 50%, 90%, 95%…etc.
literally that’s it…
Credible Interval (Bayesian): An interval within which the parameter \(\theta\) lies with a certain probability, given the observed data and the priors chosen.
Confidence Interval (Frequentist): An interval constructed so that under repeated sampling, a certain proportion of such intervals will contain the true parameter value \(\theta\).
Equal Tailed Interval: choose an interval with \(c\)% of the mass of the posterior, with \(\frac{1-c}{2}\) of the mass in both the upper and lower tails.
\[ P(\theta \leq lb \mid x) = \frac{1-c}{2} \\ P(\theta \geq ub \mid x) = \frac{1-c}{2} \]
This looks a little different when you have a skewed distribution.
What seems off here?
Highest Density Posterior Interval: choose an interval where the density \(p(\theta \mid x)\) is higher than outside the interval and contains a specified probability mass.
For all \(\theta \in [lb, ub]\) and \(\theta' \notin [lb,ub]\), \(p(\theta \mid x) \geq p(\theta' \mid x)\)
Unlike the ETI, the HDPI is not constrained to have equal tails.
❓What would an ETI look like here? Any issues with that? Any issues with this HDI?
ETI: easy to calculate, splits the excluded values from the posterior equally between extreme highs and extreme lows, most similar to frequentist CIs
HDI: will always contain the mode(s) of the posterior, and can account for asymmetry, will the be narrowest interval for a given confidence level \(c\)%
The nice thing about Posteriors is that it’s easy to calculate any summary (point or interval) of the posterior that’s of interest to you.
We have the posterior for \(d\) the difference between the mean height of blondes, and the mean height of brunettes (\(\mu_{bl} - \mu_{br}\)), what’s the probability that \(d < 0\)
We have the posterior for \(p\) the payout you might get from the 1,000 lottery tickets you just bought, what would you win in the top 10% of scenarios you could expect to happen?
We have the posterior for \(t\), the amount of tips you expect to get from your hairdressing job. You only like going to work if you think you’ll make more than $200 in tips. What’s the probability that \(t \geq 200\)?
We have the posterior for \(c\), the click-rate of your new email campaign. We want to know whether it’s equivalent to the current campaign that has a click rate of 0.02. What’s the probability that \(0.01 \leq c \leq 0.03\) (which we’ve defined is practically equivalent)?
Region of Practical Equivalence: an interval/range of values that are practically equivalent to no effect.
Smallest Effect Size of Interest: the smallest effect size that would be meaningful, clinically relevant, or impactful.
Define a ROPE (use domain expertise or a “standard” small value like \(\frac{1}{10} sd\) )
Calculate what % of your Posterior CI overlaps with ROPE
If there is a lot of overlap, evidence for practical equivalence. If little overlap, evidence for non-equivalence.